1 

Capacity of DNA Data Embedding Under 
Substitution Mutations 

Felix Balado 
Abstract 

A number of methods have been proposed over the last decade for encoding information using 
deoxyribonucleic acid (DNA), giving rise to the emerging area of DNA data embedding. Since a DNA 
sequence is conceptually equivalent to a sequence of quaternary symbols (bases), DNA data embedding 
(diversely called DNA watermarking or DNA steganography) can be seen as a digital communications 
problem where channel errors are tantamount to mutations of DNA bases. Depending on the use of coding 
or noncoding DNA hosts, which, respectively, denote DNA segments that can or cannot be translated 
into proteins, DNA data embedding is essentially a problem of communications with or without side 
information at the encoder In this paper the Shannon capacity of DNA data embedding is obtained for 
the case in which DNA sequences are subject to substitution mutations modelled using the Kimura model 
from molecular evolution studies. Inferences are also drawn with respect to the biological implications 
of some of the results presented. 

I. Introduction 

The last ten years have witnessed the proposal of numerous practical methods HI, lH, 131, H, lH, 
m, Q, lO for encoding nongenetic information using DNA molecules as a medium both in vitro 
and in vivo. A conspicuous use of these techniques recently took place when Craig Venter's group 
produced the first artificial bacteria including "watermarked" information ifTOl . All of these information 
encoding proposals hinge on the fact that DNA molecules — which encode genetic information in all living 
organisms, except for some viruses — are conceptually equivalent to sequences of quaternary symbols. 
Therefore DNA data embedding is in essence an instance of digital communications in which channel 
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GCA AGA AAC GAC TGC CAA GAA GGA CAC ATA CTA AAA ATG TTC CCA AGC ACA TGG TAC GTA TAA 
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TABLE I 

Equivalences between amino acids and codons (genetic code). Start codons, which double as regular 

codons, are underlined. 



errors are tantamount to mutations of DNA components. The two broad fields of application of DNA 
data embedding techniques are: 1) the use of DNA strands as self -replicating nano-memories able to 
store huge amounts of data in an ultra-compact way; and 2) security and tracking applications for genetic 
material afforded by embedding nongenetic information in DNA (DNA watermarking, steganography, 
and fingerprinting). 

The most basic information theoretical issue in DNA data embedding is the establishment of the upper 
limit on the amount of information that can be reliably embedded within DNA under a given level of 
mutations, that is, its Shannon capacity [11]. In this paper we obtain the capacity of DNA data embed- 
ding under substitution mutations — which randomly switch the value of bases in a DNA sequence — 
modelled through a symmetric memoryless channel which was firstly used to study molecular evolution 
by Kimura |[T2l . The capacity problem can be straightforwardly tackled when no side information is used 
by the encoder. The side-informed scenario requires more attention for reasons that will become clear 
later, and thus occupies us for the best part of this paper. Some biological implications at large of these 
information theoretical results are also discussed. In particular, the non side-informed scenario happens 
to be closely connected to previous studies by May, Battail, and other authors that have tried to apply 
information theoretical concepts to molecular biology. 

II. Preliminary Concepts and Assumptions 

Chemically, DNA is formed by two backbone strands helicoidally twisted around each other, and 
mutually attached by means of two base sequences. The four possible bases are the molecules adenine, 
cytosine, thymine, and guanine, abbreviated A, C, T and G, respectively. Only the pairings A-T and C-G 
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can exist between the two strands, which is why each of the two base sequences is completely determined 
by the other, and also why the length of a DNA molecule is measured in base pairs (bp). According to this 
brief description, the interpretation of DNA as a one-dimensional discrete digital signal is straightforward: 
any of the two strands constitutes a digital sequence formed by symbols from a quaternary alphabet. 

As regards the biological meaning of DNA, for the purposes of our analysis it suffices to know that 
codons — the minimal biological "codewords" — are formed by triplets of consecutive bases in a base 
sequence. Given any three consecutive bases there is no ambiguity in the codon they stand for, since there 
is only one direction in which a base sequence can be read. In molecular biology this is called the 5 '-3' 
direction, in reference to certain chemical feature points in a DNA backbone strand. The two strands in 
a DNA molecule are read in opposite directions, and because of this and of their complementarity they 
are termed antiparallel. Groups of consecutive codons in some special regions of a DNA sequence can be 
translated into a series of chemical compounds called amino acids via transcription to the intermediary 
ribonucleic acid (RNA) molecule. RNA is similar to DNA but single stranded and with uracil (abbreviated 
U) replacing thymine. Amino acids are sequentially assembled in the same order imposed by the codon 
sequence. The result of this assembling process are proteins, which are the basic compounds of the 
chemistry of life. There are 4^ = 64 possible codons, since they are triplets of 4-ary symbols. Crucially, 
there are only 20 possible amino acids, mapped to the 64 codons according to the so-called genetic code 
in Table Jl which will be explained in more detail later. The genetic code effectively implements built-in 
redundancy in terms of protecting protein expression. 

The genome of an organism is the ensemble of all its DNA. Segments of a genome that can be 
translated into proteins through the process described above are called coding DNA (cDNA), whereas 
those segments that never get translated are called noncoding DNA (ncDNA). A gene is a cDNA segment, 
or group of segments, which encodes one single protein, and which is flanked by certain start and stop 
codons (see Table ^ plus other markers. 

Finally, for each base sequence there are three different reading frames which determine three different 
codon sequences. The correct reading frame is marked by the position of a start codon. 

The main assumptions that we will make in our analysis are the following ones: 

• ncDNA can be freely appended or overwritten. Although ncDNA does not encode genes, this 
assumption does not always hold true. This is because certain ncDNA regions act as promoters 
for gene expression, or are transcribed into regulatory RNA (but not translated into proteins). 
However this working hypothesis is valid in suitably chosen ncDNA regions, as proved by several 
researchers H, Q employing live organisms. 
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• cDNA can be freely modified as long as the genetic code is observed. This is the classic standard 
assumption supporting the validity of the genetic code. In practice living organisms feature preferred 
codon statistics, which, if modified, might alter gene expression (for instance translation times, 
among other effects). Therefore we will also discuss codon statistics preservation in our analysis. 
Finally we must mention that we will only consider nonoverlapping genes (either on the same or 
on opposite strands). However overlapping genes are in any case rare occurrences, except in very 
compact genomes. 

Notation. Calligraphic letters {X) denote sets; \X\ is the cardinality of X. Boldface letters (x) denote 
row vectors, and 1 is an all-ones vector. If a Roman letter is used both in uppercase {X) and lowercase (x), 
the two forms denote a random variable and a realisation of it, respectively. p{X = x) is the probability 
mass function (pmf) of X; we will simply write p{x) when the variable is clear from the context. E[X] 
is the mathematical expectation of X, and H{X) its entropy. Also, h{q) is the entropy of a Bernoulli(g) 
random variable. I{X;Y) is the mutual information between X and Y. Logarithms are base 2, unless 
explicitly indicated otherwise. The Hamming distance between vectors x and y is denoted by (i//(x, y). 

A ncDNA sequence will be denoted by a vector x'' = [xi, 2:2, • • • , Xn], whose elements are consecutive 
bases from a base sequence. That is, Xi ^ X = {A, C, T, G}, the 4-ary set of possible bases. A cDNA 
sequence will be denoted by a vector of vectors x^ = [xi , X2 , • • • , x„] whose elements are consecutive 
codons from one of the two antiparallel base sequences, assuming a suitable reading frame among the 
three possible ones. Therefore, Xj G X^. We denote by x'^ = a(xj) G X' the amino acid into which 
a codon Xj uniquely translates, which is further discussed below. Also x' = a(x'^) = [x'^,X2, • • • , x^] 
denotes the unique amino acid sequence established by x'^, usually called the primary structure. Using the 
standard three-letter abbreviations of the amino acid names, we define the set X' = {Ala, Arg, Asn, Asp, 
Cys, Gin, Glu, Gly, His, He, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr, Val, Stp}. The subset of codons 
associated with amino acid x' € X' , that is, Sx' = {x G Af^|a(x) = x'}, is established by the genetic 
code shown in Table Jl The ensemble of stop codons, that is, the stop symbol Stp, is loosely classed as an 
"amino acid" for notational convenience, although it does not actually map to any compound but rather 
indicates the end of a gene. We call the number of codons \Sx' \ mapping to amino acid x' the multiplicity 
of x'. Due to the uniqueness of the mapping from codons to amino acids, see that Sx- H Syi = for 
x' 7^ y' G X' , and that Ylix'&X' \Sx'\ = = 64 since [Jx'&X'Sx' = X^. Finally, an example of a cDNA 
sequence may be for instance x^ = [[T, A, T], [T, G, C]], which would encode the amino acid sequence 
x' = a(x'^) = [Tyr, Cys]. The corresponding base sequence would be x'' = [T, A, T, T, G, C]. 
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A. Mutation channel model 

As mentioned in the introduction, an information-carrying DNA molecule undergoing mutations can 
be readily seen as a digital signal undergoing a noisy communications channel, which we may term 
"mutation channel" in this context. We will only consider herein substitution mutations (also called 
point mutations), that is, those that randomly switch letters from the DNA alphabet. We will assume that 
mutations are mutually independent, which is a worst-case scenario in terms of capacity. Therefore we are 
assuming that the channel is memoryless and thus accepts a single-letter characterisation; consequently, 
we will drop vector element subindices whenever this is unambiguous and notationally convenient. 

We will model the channel by means of the two-parameter Kimura model of nucleotide substitu- 
tion |[T2l . This consists of a 4 x 4 transition probability matrix 11 = \p{Z = z\Y = y)], where z,y G X, 

and which presents the following structure: 

A C T G 

1-q Iq Iq [1 - ^-)q 

_(l-f)<Z ^q i<7 l-q 

From this definition, the probability of base substitution mutation, or base substitution mutation rate, is 

q=p{Z^y\Y = y) = Y,P{Z = z\Y = y), (2) 

for any y £ X, whereas it must hold that 0<7<3/2so that row probabilities add up to one. The 
particular structure of 11 aims at reflecting the fact that DNA bases belong to one of two categories 
according their chemical structure: purines, TZ = {A, G}, or pyrimidines, 3^ = {C,T}. There are two 
types of base substitutions associated to these categories, which in biological nomenclature are: 

• Base transitions: those that preserve the category which the base belongs to. In this case the model 
establishes that p{Z = z\Y = y) = {1 — 2^/3)q forz^y when either both z,y G 7^ or both 

• Base transversions: those that switch the base category. In this case the model establishes that 
p{Z = z\Y = y) = (7/3)9 forzy^y when z £ y and z G 7^, or vice versa. 

The channel model ([T]) can incorporate any given transition/transversion ratio e by setting 7 = 3/(2(e + 
1)). Estimates of e given in |[T3l for the DNA of different organisms range between 0.89 and 18.67, 
corresponding to 7 between 0.07 and 0.79. This range of e reflects the fact that base transitions are 
generally much more likely than base transversions due to the chemical similarity among compounds in 



A 
C 
T 
G 



(1) 
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the same category, that is, e > 1/2 virtually always in every organism, and therefore 7 < 1. However 
many mutation estimation studies focus only on the determination of q (see for instance |[T4ll ). and then 
one may assume the simplification 7 = 1 in the absence of further details. We will make observations at 
several points for this particular case, which is known as the Jukes-Cantor model in molecular evolution 
studies. In this situation all off-diagonal entries of 11 are equal, that is, p{Z = z\Y = y) = q/3 for all 

Note that the mutation model that we have chosen implies a symmetric channel, since all rows (columns) 
of n contain the same four probabilities. Among the memoryless models used in molecular evolution, the 
Kimura model is the one with higher number of parameters which still yields a symmetric channel. As 
it is well known, this is advantageous in capacity computations and will be exploited whenever possible. 
In the most general case a time-reversible substitution mutations model may have up to 9 independent 
parameters, and yield a nonsymmetric channel. However according to Li |[T5l mutation models with many 
parameters are not necessarily accurate, due to the estimation issues involved. 



Under m cascaded mutation stages we have a Markov chain Y ^ Z, 



(1) 



(2) 



^(m) 



, and 



model ([T]) leads to the overall transition probability matrix between Y and . As 11 = 11^ we can 
write n™ = VD™ V^, with the eigenvalues of 11 arranged in a diagonal matrix D = diag(l, A, /x, fi), 
where 



A = 1 1 



1-2 1 



(3) 
(4) 



and V a matrix whose columns are the normalised eigenvectors of 11 associated to the corresponding 
eigenvalues in D, that is 



v4 



(5) 



1 1 -V2 
1 -1 -V2 
1-1 \/2 
1 1 y/2 

From the diagonalisation of IT" it is straightforward to see that the elements of its diagonal all take the 
value 1(1 + 2fi"^ + A"*), the elements of its skew diagonal take the value 5 (1 — 2/2"^ + A™), and the rest 
of its entries are ^ (1 — A™). Therefore any row (column) of this matrix contains the same probabilities, 
as is also the transition matrix of a symmetric channel. From the diagonal elements one can see that 
the accumulated base substitution mutation rate after m cascaded stages is given by 



(m) 



1 



p{Z^m) / y|r = y) = 1 - - (1 + 2/.™ + A™) . 



(6) 
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When q > 0, limm-^oo Q^'^^\'y>o = 3/4 but liuim^oo Q^'^^ \-y=o = 1/2, because \fj,\ < 1 for any 7 
and |A| < 1 when 7 > 0, but A = 1 when 7 = 0. The behaviour of this particular case is connected 
to the fact that we must have both q G (0,1] and 7 G (0,3/2] for the Markov chain to be aperiodic 
and irreducible, and thus possess a limiting stationary distribution. From the previous considerations, 
the limiting distribution — that is, the distribution of Z(^^-^ — is uniform, because limm->oon'" = 
When 7 = 1 and g = 3/4 then 11 = jl^l, and hence every Z(^) is uniformly distributed as in the 
limiting case. 

Lastly, under the base substitution mutation model that we are considering, codons undergo a mutation 
channel modelled by the 64 x 64 transition probability matrix 

n = [p(z = z|Y = y)] = n®n®n, (7) 

where (8> is the Kronecker product. This is because p{Z = z\Y = y) = n?=i = Zi\Y = y^) according 
to our memoryless channel assumption. Trivially this channel is also symmetric. When m mutation stages 
are considered, replaces n in ©, since n™ = (n ® n ® n)™ = n*" O H"* (g) n™ IH. 

III. Capacity Analysis 

A. Noncoding DNA 

We will firstly consider this simple case, which will also establish a basic upper bound to cDNA 
capacity. As per our discussion in Section |lll we are assuming that a embedder can overwrite or append 
a host ncDNA strand x'', which amounts to freely choosing the input to the mutation channel. Therefore 
in this case the channel capacity is given by Cnc — max/(Z(^);y) bits/base, where the maximisation 
is over all distributions of Y. For the mutation model considered, this capacity is that of the symmetric 
channel, in which i7(Z(f„)|y) is independent of the input and uniformly distributed Y leads to uniformly 



distributed Z(j^y Hence 



Cnc = log \X\ - H{Z^^)\Y) bits/base, (8) 



where H{Z^^)\Y) = J^zexPi^M = z\Y = ?/)logp(Z(^) = z\Y = y) for any y e X, that is, the 
entropy of any row of IT™. Therefore 

H{Z(^)\Y) = -i(l + 2^- + A-)log Q(1 + 2m'" + A-)) 
_i (1 _ 2^ + A-) log Q (1 - 2^ + 

-i(l-A™)logfi(l-A")V (9) 
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As long as the Markov chain is aperiodic and irreducible then lim.m-j.oo C^c = 0. The reason is that 
since the limiting distribution is independent of Y, then limm-i-oo H{Z(^jn)\Y) = H{Z(^oo)) = log It 
is interesting to note that, under aperiodicity and irreducibility of the Markov chain, this zero limiting 
capacity will also apply to models more involved than ([T]), such as those in which the channel matrix is 
parametrised by up to 9 independent values. Lastly, we also have that (711017=1,5=3/4 = 0, since in this 
case Z(^) is always uniformly distributed. 

As a function of 7 the ncDNA capacity is bounded as follows 

(711017=1 ^ Cnc ^ C'iic|7=0- (10) 

Although it can be shown with some effort that these inequalities always hold true, it is much simpler to 
prove them for the range of interest 7 < 1 and q < 1/2. The latter condition implies that both < A < 1 
and < ^ < 1. For fixed m and q, the maximum (respectively, minimum) of C^c over 7 corresponds to 
the minimum (respectively, maximum) of the accumulated base mutation rate q^"^\ Differentiating Q we 
obtain dq^^^^'/dj = {mq/3) (A'""'^ — fi"^~^). Therefore q^"^^ is monotonically increasing when 7 < 1 
(as this corresponds to A > /i), and then its maximum in that range occurs when 7=1 and its minimum 
when 7 = 0. 

The upper bound can be written as 

Cnc|7=0 = 2 - /l Q + ^(1 - 2(/)™j . (11) 

Notice that limm^-oo C'ncl7=o = 1> that is, the capacity limit is not zero when 7 = because then the 
Markov chain is reducible. This case cannot happen in practice since it would imply that transversion 
mutations are impossible, but it illustrates that the higher the transition/transversion ratio e, the higher 
the capacity. 

Figures [T] and |2] show C^c for two different values of q representative of extreme values of the base 
substitution mutation range q per replication found in different living beings and different sections of 
genomes |[T4l . We observe the validity of the bounds (ITOl ) and the limiting behaviours discussed. From 
these figures we can also empirically see that a rule-of-thumb capacity cut-off point is given by m ~ 
6/(57(7). 

a ) Biological interpretations: We would like to point out that expression ([8]) also gives the maximum 
mutual information between a DNA strand and its mutated version in natural scenarios, independent of 
DNA data embedding procedures. This has sometimes been termed the capacity of the genetic channel 
in studies applying information theory to molecular biology. Several authors have used the Jukes-Cantor 



January 19, 2011 



DRAFT 



9 



Cascaded mutation stages (m) 



Fig. 1. Embedding capacity in ncDNA (g = 10 ) 
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Cascaded mutation stages (m) 
Fig. 2. Embedding capacity in ncDNA (g = 10"^) 



model and particular cases of the Kimura model — apparently unaware of the prior use of these models 
in molecular evolution studies — in order to estimate this capacity. The case m = 1 was numerically 
evaluated by May et al. ifTTl . using values of q estimated from different organisms and the Jukes-Cantor 
and Kimura (7 = 1/2) models. Some authors have also considered the behaviour of capacity under 
cascaded mutation stages, that is, m > 1. Gutfraind lITSl discussed the basic effect of cascaded mutations 
on capacity (exponential decrease with m), although using a binary alphabet and the binary symmetric 
channel. Both Battail |iT9l and May |[20l computed capacity under cascaded mutation stages using a 
quaternary alphabet and the Jukes-Cantor model. The first author obtained his results analytically — but 
using a continuous-time approach rather than the discrete-time approach followed here — and the second 
one numerically. The results by Battail are essentially consistent with the ones presented here (similar 
capacity cut-off point), but the ones by May are not. The capacity plots in |[20l (taking the results for 
the human genome) show a cut-off point of m 10^ for q « 10~^, whereas m ^ 10^ would have been 
expected according to Figure |2] Considering the extents of geological time, where m can easily reach 
10^ and beyond, it seems clear that the results in |[20l underestimate capacity for m > 1. 

In any case, none of the aforementioned approaches reflects the capacity increase afforded by a mutation 
model allowing 7 < 1. It is possible that the trend towards higher capacity observed as 7 — implies that 
evolution has favoured genetic building blocks which feature an asymmetric behaviour under mutations 
(in our case, pyrimidines versus purines instead of a hypothetically perfectly symmetric set of four bases 
for which 7 = 1). If this assumption is correct, this symmetry breaking must have occurred early in 
evolutionary terms, since it is widely believed that the current genetic machinery evolved from a former 



January 19, 2011 



DRAFT 



10 



"RNA world" ll2ll in which life would only have been based on the self-replicating and catalysing 
properties of RNA. In the RNA world there would not have been translation to proteins, and therefore no 
genetic code, and hence information was freely encoded using a 4-ary alphabet almost exactly like the one 
used in DNA. Note that uracil, which replaces thymine in RNA, is also a pyrimidine, that is, in the RNA 
world y = {U,C}. With these facts in mind, we may model the maximum transmissible information 
under mutations in the RNA world by relying on ([8]l, and thus see that the symmetry breaking conjecture 
above applies to the evolution of RNA from predecessor genetic building blocks. We must bear in mind 
that single-stranded molecules, such as RNA, are much more mutation-prone than double-stranded ones 
such as Therefore smaller values of m would have sufficed for some type of symmetry breaking 

to be relevant in terms of information transmission at early stages of life. 

B. Coding DNA 

Unlike in the ncDNA case, embedding information in cDNA is a problem of coding with side 
information at the encoder. Given a host sequence x*^, the encoder has to modify this host to produce an 
information-carrying sequence y'^ which must also encode the same primary structure as x'^ according 
to the genetic code. This is equivalent to hiding data in a discrete host under an embedding constraint. 
Nevertheless, apart from the trivial difference of using a 4-ary instead of typically a 2-ary alphabet, several 
issues set apart cDNA data embedding as a special problem. In order to illustrate these issues consider 
momentarily a typical data hiding scenario in which a discrete binary host, that is x = [xi,--- ,Xn] 
with Xi ^ X = {0, 1}, is modified to embed a message m from a certain alphabet. The watermarked 
signal y = e(x, in) must be close to x, where closeness is usually measured by means of the Hamming 
distance (i//(y,x). Pradhan et al. |[23ll and Barron et al. |[24l have determined the achievable rate in 
this scenario, assuming that the elements of X are uniformly distributed, using the average distortion 
constraint ^E[dH(y, X)] < d, and supposing that y undergoes a memoryless binary symmetric channel 
with crossover probability q. Their result is 

i^u^^if = u.c.e.{/i(d) - h{q)} bits/host symbol, 

where u.c.e{-} is the upper concave envelope. Similarly, our initial goal for cDNA data embedding is 
obtaining the achievable rate for a fixed distribution of X' = a(X) under the symmetric channel discussed 
in Section III-AI in particular when X is uniformly distributed as in the analyses of Pradhan et al. ll23l 

'For instance, RNA viruses such as HIV are known to exhibit base mutation rates of up to 10"'^ per year II22I . 
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and Barron et al. 1241 . Furthermore we will also obtain capacity, that is, the maximum achievable rate 
over all distributions of the host X' . 

The first important difference in the cDNA data embedding scenario is that average inequality con- 
straints on the Hamming distance — such as the ones used in |[23l . |[24l — are meaningless if one wants 
to carry through to y*^ the full biological functionality of x'^. Instead, since it must always hold that 
"^(y^) = a(x'^), one must establish the deterministic constraint 

n 

dH{y',^) = Y,dH{y'i,x[) = Q. (12) 

i=l 

This requires that x') = for alH = 1, • • • , n. 

The second distinguishing feature of cDNA data embedding is due to the variable support of the 
channel input variable. Whereas in discrete data hiding with binary host one always has that yi G {0, 1} 
independently of Xj, in cDNA data embedding we have that yj G '^^(x;) so that the constraint ([T2l ) can 
always be satisfied. Therefore the support of y^ is dependent on Xj, as codon equivalence is not evenly 
spread over the ensemble of amino acids (see Table J]). 

1 ) Achievable Rate: Since side information at the encoder must be taken into account in the cDNA 
case, then the achievable rate is given by Gel'fand and Pinsker's formula {25l = max/(Z(„); U) — 
/(X';U) bits/codon, where the maximisation is for nonnegative values of the functional on all distri- 
butions p{y,u\x') under the constraint d//(a(y),x') = 0, with U an auxiliary random variable that we 
will discuss next. Note that R'^ represents the maximum achievable rate when the host cDNA amino 
acid sequence is distributed as X'. 

Gel'fand and Pinsker showed in |[25l that in the maximisation problem above one may assume that 
the channel input is a deterministic function of the side information X' and the auxiliary variable U, 
that is, Y = e{X', U). Since the support of Y\x' must be the set of codons Sx' corresponding to amino 
acid x' — so that the biological constraint can always be satisfied — then the cardinality of the support 
of U|x' has to coincide with the multiplicity of x', that is, \Sx'\- The support of U|x' must actually be 
Sx', because U must also act as a good source code for X' in order to minimise I{X';\J) under the 
genetic constraint, and if the support of U|a;' is otherwise then the constraint cannot always be met. One 
can now establish Y\x' = \J\x' without loss of generality, although any permutation of the elements 
of Sx> is actually valid to define Y\x' = e(x',U). Therefore in the following one may consider that 
Y = U, that is, that U is the mutation channel input. Noticing that Sx' H Sy' = for x' 7^ y' G X', the 
distribution of U can be put as p{u) = p{u\x')p{x') when u G Sx'- This discussion on U also implies 
that H{X'\U) = 0, since given a codon u there is no uncertainty on the amino acid represented, and 
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therefore I{X';V) = H{X') 

Since Y|(x',u) is deterministic, from the considerations above we have that the achievable rate for a 
fixed distribution of X' is given by 

Rf = max /(Z(„);U) - H{X') bits/codon. (13) 

As //(Z(,;„)|U) only depends on the transition probabilities of the symmetric channel, and as trivially 
H{X') only depends on X' , (fT3l ) amounts to the constrained maximisation of H{Z(^^)). 

There are several cases in which ([T3] ) can be analytically determined, which are discussed next. First of 
all, since Cnc | '1=1,5=3/4 — then R^ |^_]^ ^—3/4 — for any X\ because R^ < SC^ic- Therefore in this 
catastrophic case the choice of p{u\x') is irrelevant. Furthermore it can be shown that p{u\x') = l/\Sx'\, 
that is, U|x' uniformly distributed, is the maximising strategy in two situations, which are discussed in 
the following lemmas. 

Lemma 1. If q = then the achievable rate is 

Rc'\q=Q = E\iog\Sx'W bits/codon. (14) 

Proof: Using the chain rule of the entropy we can write i?(U, X') = F(U) + F(U|X') = H{X') + 
H{X'\\]). As H{X'\\J) = 0, and as Z(^) = U when (7 = 0, then the achievable rate is given by 
Rc '\q=o = maxp(u|a;') H{\J)-H{X') = maxp(u|^,,) H{\]\X'). We just need to see now that H{JJ\X') = 
p(x')ff(U|x') is maximised when H([J\x') is maximum for all x' , which implies that U|x' be 
uniformly distributed in all cases. Then H{\J\x') = log \Sx'\ and ([141 ) follows. ■ 
Remark. Note that ([141) is the embedding rate intuitively expected in the mutation-free case. For example, 
if X were uniformly distributed, which would yield X' = a{'K) nonuniform with pmf p{x') = |5a;'|/|A'p, 
then we would obviously compute the rate as i?" ''™'^^|(j=o = Yl,x' l*^^'! ~ 1-7819 bits/codon, since 

\Sx'\ choices are available to the embedder when the host amino acid is x' . The rate in the uniform case 
can actually be obtained in closed form for every q using the following result. 

Lemma 2. Tjf X is uniformly distributed then the achievable rate is 

^a(unif) ^ _ jj^^r^ bits/codon, (15) 

where Cnc — max/(Z(^); U) and this maximisation is unconstrained on p(u), that is, Cnc is the capacity 
of the symmetric codon mutation channel. 

Proof: Since p{u) = p{u\x')p{x') when u G Sx', with uniformly distributed X we have that 
p{u) = p{u\x')\Sx'\/\X\^ when u G Sx'- Therefore choosing U|x' to be uniformly distributed implies 
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that p(u) = 1/1 A"!^ for all u. Since 11™ is symmetric and a uniform input maximises mutual information 
over a symmetric channel, then Cnc = max/(Z(m); U) is achieved in (fT3l ). ■ 
Remarks. Since Cnc\q=o = logj^Yp, observe that the particular case in the previous remark can be 
written as well as Rc^""^^^\g^Q = log\X\^ — H{X'). An interesting insight is also afforded by seeing 
that the three parallel symmetric channels undergone by the bases in a codon are mutually independent, 
and hence one can use the equality Cnc = 3 Cnc in (fTSl) . As H{X') is the lower bound to the lossless 
source coding rate of X' , expression ([TS] ) tells a fact that is intuitively appealing but which is only exact 
when X is uniform: the cDNA embedding rate is the same as three times the ncDNA embedding rate 
minus the rate needed to losslessly convey the primary structure of the host to the decoder. 

Unlike in the case considered above, the distribution of X in real cDNA sequences (that is, genes) is 
not uniform. To start with, there can only be a single Stp codon in a sequence that encodes a protein 
(gene). As in many other channel capacity problems, it does not seem possible in general to analytically 
derive the optimum set of pmf's p(u|x') in order to compute the achievable rate R^' corresponding 
to a host distributed as X' . To see why one can pose the analytical optimisation problem and see that 
it involves solving a nontrivial system of + nonlinear equations and unknowns. However the 
numerical solution is straightforward by means of the Blahut-Arimoto algorithm |[26l adapted to the 
side-informed scenario. Such an algorithm has been described by Dupuis et al. |[27l . An example of the 
optimal distributions numerically obtained for a particular case is shown in Figure [3] 

A last observation is that, in general, p{\i\x') = turns out to yield a good approximation to 

the exact numerical solution. Note that for distributions of X' different from the one in Lemma |2] one 
cannot produce a uniform input U to the symmetric channel in order to generate a uniform output Z(„). 
This is the case illustrated in Figure [3l nonetheless, note from this figure that p{\i\x') does not differ 
excessively from a uniform distribution for several amino acids x' . A justification of this behaviour is as 
follows. Using the fact that conditioning cannot increase entropy, a suboptimal maximisation approach 
is given by maximising the lower bound //(Z(„^)|X') = Xlx'eA" P(^')-^(2{m)|2;') ^ ^(Z(m))- This 
requires maximising //(Z(„^)|2;') = — ^^^^3 p(z|x') logp(z|x') for all x' . Observing from Table U that 
codons mapping to the same amino acid share in many cases up to two bases, we can approximate 
p{z\x') « when z ^ Sx' and p(z|x') « ^X]ve5 / Pi'A^)^ Ylues , Pi^\^)pi^W) when z e S^'- With 
this approximation, whenever X]ue5 / ^(^l'^) constant for all z € Sx', choosing p{\i\x') to be uniform 
implies that p(z|x') is also uniform, which maximises //(Z(„)|x'). It can be verified that this condition 
holds for all x' such that \Sx'\ = 1,2,4, which accounts for 16 out of the 21 elements in X' . 

Figures |4]|7] present the achievable rates for several distributions of X' . Shown are the rates for the 



January 19, 2011 



DRAFT 



14 



1 

0.9 
0.8 
0.7 

^ 0-6 
~^ 0.5 
0.4 
0.3 
0.2 
0.1 




OJJOo CDCXDCja^ 
CDCX^CJ «OOOo C?^ l-l- 



<S3 «3 <t>-p 



Fig. 3. Example of maximising p{u\x') distributions numerically obtained using the Blahut-Arimoto algorithm and p(x') 
corresponding to gene Ypt7 from yeast (GenBank accession number NC_001145), employing 7 = 0.1, q = 10"^, m = 10. 
Conditional pmf 's are depicted in alternating red and blue colours to facilitate plot reading. 



distributions corresponding to two real genes: Ypt7 {S. Cerevisiae) and FtsZ {B. Subtilis), whose GenBank 
accession numbers are NC_001145 and NC_000964, respectively; also depicted are the rate (fTSl ) for X 
uniform and the rate for the deterministic distribution of X' with outcome Ser, which, as we will discuss 
in Section ITlI-B 2 [ yields capacity. We observe in these plots that there is barely a difference in the results 
obtained with the Blahut-Arimoto algorithm and the uniform approximation to p(u|x') that we have 
discussed. 

a) Codon statistics preservation: If we require that the original codon statistics of the host are 
preserved in the information-carrying sequence, then we must peg p{u\x') to the corresponding distri- 
bution of the host. Therefore in this case no maximisation is required, and the corresponding rate will 
be lower or equal than the one achieved without codon statistics preservation. This type of constraint is 
equivalent to a steganographic constraint in data hiding, since the pmf of the host is preserved in the 
information-carrying sequence. A comparison of maximum rates and codon statistics preservation rates 
for the same genes as before is given in Figure [8] Note that in the uniform case both rates coincide 
because of Lemma |2] 
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Fig. 4. Embedding rate in cDNA for different distributions of Fig. 5. Embedding rate in cDNA for different distributions of 

X' (7 = 1, Q = 10-2) X' (7=l,g=10-^) 




Cascaded mutation stages (m) 
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Fig. 6. Embedding rate in cDNA for different distributions of Fig. 7. Embedding rate in cDNA for different distributions of 
X' (7 = 0.1, q = 10-2) X' (7 = 0.1, q = IQ-^) 



2) Capacity: It remains the computation of capacity, that is, 

Cc = max bits/codon. (16) 

p{x') 

It is simple to explicitly obtain Cc in two particular cases discussed in Section IIII-Bll 

• 7 = 1 and q = 3/4: obviously, Cc|^=i ^=3/4 = i?^' ^=3/4 = 0. However, the point that has to be 
made here is that, although this is true for any X' , only deterministic X' yields H(X') = exactly, 
and hence, by the continuity of the rate functional, this will be the best strategy when approaching 
(/ = 3/4 from the left. 
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Cascaded mutation stages (to) 

Fig. 8. Comparison of cDNA embedding rate witli and witliout codon statistics preservation constraints, for different distributions 
of X' (7 = 0.1, Q = 10"^) 

• q = 0: From Lemma [H Cc\q=o = maxy/ log \Sy'\. Since \Sx'\ = 6 is maximum for all x' G W' = 
{Ser, Leu, Arg}, a distribution of X' that maximises (fT4l ) is any for which Xlz'eW ^^(^0 ~ ^- Note 
that X' needs not be deterministic. Capacity is then Cc|g=o = log 6 = 2.5850 bits/codon. 

Remark. A trivial upper bound for any q is Cc < Cc|g=o- Since Cc|g=o < 3Cnc|<j=o = 6, then 
side-informed cDNA data embedding capacity will not be able to achieve non-side-informed ncDNA 
capacity for every mutation rate. This is similar to parallel results in side-informed encoding with discrete 
hosts |[23l . |[24l (for uniform side information), and unlike the well-known result by Costa for continuous 
Gaussian hosts |[28l . 

From our previous discussion on the value of Cc for two particular cases one may conjecture that a 
pmf with support in W' may be capacity-achieving. The actual capacity-achieving strategy is given by 
the following theorem: 

Theorem 1. Capacity is achieved by the deterministic pmf of X' that maximises H{7ji^^-^). 

Proof: See Appendix lAl ■ 
Remarks. Denoting as ^' the deterministic outcome of X' , it can be numerically verified that ^' = Ser 
maximises i7(Z(„^)) for all 7 < 1, m, and q, and thus Cc = R^'^'^ in these conditions. Some examples of 
the rates achievable with deterministic X' are shown in Figures l9l[T2l These figures show that the rates 
using the linearised approximation given in Appendix |A] are practically indistinguishable from ones using 
the Blahut-Arimoto algorithm, whereas the approximation p(u|x') = is also good but worsens as 
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7 decreases. 

a) Biological interpretations: The results in this section solely concern artificial embedding of 
information in cDNA, and thus seem to have less obvious applicability in biological terms than the ones 
concerning ncDNA. However an intriguing phenomenon which somehow indicates a biological connection 
of these results can be observed in Figures [TT]fT2] which concern achievable rates for sequences encoding 
a single amino acid/symbol. The effect is observed for 7 < 1 — that is, the range of 7 in which the 
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model is more realistic — and consists of a rate droop for two particular values of as m — )• oo with 
respect to all other symbols presenting the same multiplicity. The particularity is that these two values of 
^' correspond to the stop symbol (Stp) and the amino acid Leu, which happens to double as start codon 
in prokaryotes. Therefore all stop codons and most of the start codons seem to be less suited to carrying 
extra information (redundancy) when isolated. It is not obvious how to interpret this effect, but one may 
surmise that these codons might have suffered some type of selective pressure during the emergence of 
the genetic code which somehow depended on the information theoretical amount studied here. In the 
case of the stop symbol, this may be due to the fact that it can only appear once per gene, and so it 
makes for a bad conveyor of extra information beyond its basic function. 

IV. Conclusions 

We have provided an analysis of the embedding capacity of DNA when mutations are modelled accord- 
ing to the Kimura model from molecular evolution studies, and discussed some biological connections 
of these results. A more thorough study would require considering insertion and deletion mutations 
(indels). Although the exact computation of capacity under indels is an unsolved problem in most digital 
communications scenarios, some approximations relying on realignment methods from bioinformatics 
might suffice in this context. Generalisations of the Kimura model may also be considered. Although 
in general they will lead to nonsymmetric channels, these can be numerically handled using the Blahut- 
Arimoto algorithm. 

Appendix 

A. Capacity-achieving strategy p{x') 

In order to find the capacity-achieving strategy we need to solve 

d 



(17) 



dp{x') 

for x' G X', with v a Lagrange multiplier. In the following we will write p{z\x') = p(Z(^) = z\X' = 
x') for notational convenience. Assuming natural logarithms for simplicity, and using dp{z)/dp{x') = 
p{z\x'), (ITtI ) becomes 

^ p{z\x')log ^ p{y')p{z\y') = logp{x') + (18) 

for x' G X' . The solution remains unchanged if we multiply ([TS] ) across by p{x'). This allows us to see 
by inspection that any extreme of the Lagrangian in ([TT] ) has to be deterministic, that is, p{x') = 1 for 
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some x' = ^' and p(x') = for x' / ^ . Note that this is in agreement with the strategies for the cases 
q = ^ and 7 = 1 with g = 3/4 discussed in Section IIII-B2I See for instance that a uniform distribution 
of X' cannot possibly solve ([T8] ) for all x' G X' , because X]z^'(^l^') i^y'V^'Av')^ i^ot constant 
on x' unless 7=1 and (7 = 3/4, in which case we have shown that capacity is zero for any distribution. 

According to the previous discussion, for any capacity-achieving solution it always holds that H{X') = 
0, and then we just have to maximise //(Z(,„)) over the ensemble of 21 deterministic distributions of 
X'. 

The computation of Rc and of the maximising distribution U|^' can be done using the Blahut-Arimoto 
algorithm, following the discussion in Section Ull-Bll on the optimal strategy for fixed p{x'). Note that 
^' = Trp and ^' = Met can be ruled out outright, since |5Trp| = I^Motl = 1» and then only null rates 
are possible in these cases. Then we only need to compute i?f for 19 amino acids. Also, ^' = Stp can 
only be considered hypothetically, since this symbol can only appear exactly once in a gene. 

a ) Approximation to maximising strategy: It is also possible to provide a closed-form approximation 
to the maximising distribution U|^', which yields a better approximation to the embedding rate than just 
using the approximation p{u\^') = l/\S^>\ discussed in Section UlI-Bll Observe firstly that when X' is 
deterministic the situation is equivalent to a non-side informed discrete channel with inputs and 
\X\^ outputs, with a transition probability matrix A whose rows are the rows of n™ corresponding to 
the codons associated with In general this channel will not be symmetric nor weakly symmetric, since 
although its rows are permutations of the same set of probabilities, its columns are not, and their sum is 
not constant either. However //(Z(„)|U) is still independent of the distribution of U, and then we only 
need to maximise i7(Z(^)) to find capacity. The corresponding conditions for the maximum are 

^ p(z|v)logp(z) + 1 = /5, (19) 

for V G 5^', and with p a Lagrange multiplier. 

Using logx < X — 1 and p{z) = X]ug5j/ p(^|u)p(u|^'), we can write 

p(z|v) p{z\u)pin\e) < p, (20) 

for V G S^'. Our approximation consists of solving p(u|^') by enforcing equality in (l20l ) for all v G S^'. 
This yields the linear system 

TT (AA'^) = pi, (21) 

where the probabilities p{u\^'), with u G S^>, axe the elements of the 1 x \S^>\ vector tt (arranged in 
the same codon order as the rows of A), and 1 is an all-ones vector of size 1 x \S^>\. Since tt must be 
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Fig. 13. Comparison of maximising p(u|^') distributions for the deterministic case ^' = Leu {q = 10~^ m = 100, 7 = 0.1) 

a pmf, we may fix any arbitrary value of p, such as p = 1, and then normalise the solution tt to the 
resulting linear system, that is 

^ = 1(AA^)"^ (22) 

The matrix AA^ is invertible if both q ^ 1/(47/3) and (7/1/(2(1 — 7/3)) because in this case the 
rows of A are linearly independent. This is due to the fact that under the two conditions above the rows 
of n"* are linearly independent, since its eigenvalues are all the possible products of three eigenvalues 
of n'" |[T6l and the conditions above guarantee that these are nonzero. A sufficient condition for the 
invertibility of AA^ is g < 1/2, which spans most cases of interest. 

Since we have linearised the optimisation problem then vf may contain negative values, but in practice 
these are relatively small. Setting these values to zero and normalising tt we obtain an approximation 
to the optimum distribution p{\i\^'). An example of this approximation compared to the results of the 
Blahut-Arimoto algorithm is shown in Figure [T3l 
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